Unsupervised Transcription of Historical Documents
نویسندگان
چکیده
We present a generative probabilistic model, inspired by historical printing processes, for transcribing images of documents from the printing press era. By jointly modeling the text of the document and the noisy (but regular) process of rendering glyphs, our unsupervised system is able to decipher font structure and more accurately transcribe images into text. Overall, our system substantially outperforms state-of-the-art solutions for this task, achieving a 31% relative reduction in word error rate over the leading commercial system for historical transcription, and a 47% relative reduction over Tesseract, Google’s open source OCR system.
منابع مشابه
An Unsupervised Model of Orthographic Variation for Historical Document Transcription
Historical documents frequently exhibit extensive orthographic variation, including archaic spellings and obsolete shorthand. OCR tools typically seek to produce so-called diplomatic transcriptions that preserve these variants, but many end tasks require transcriptions with normalized orthography. In this paper, we present a novel joint transcription model that learns, unsupervised, a probabili...
متن کاملUnsupervised Code-Switching for Multilingual Historical Document Transcription
Transcribing documents from the printing press era, a challenge in its own right, is more complicated when documents interleave multiple languages—a common feature of 16th century texts. Additionally, many of these documents precede consistent orthographic conventions, making the task even harder. We extend the state-of-the-art historical OCR model of Berg-Kirkpatrick et al. (2013) to handle wo...
متن کاملSupervised Text Region Identification on Historical Documents
We present multi-column text region identification support for Ocular, the unsupervised historical printed document transcription project of Berg-Kirkpatrick et. al (2013). We use structured prediction with rich features defined on the input document and incorporate a transition model based on prior document layout assumptions. Our model is trained using a structured-SVM objective on a randomly...
متن کاملUnsupervised Analysis of Structured Human Artifacts
Unsupervised Analysis of Structured Human Artifacts by Taylor Berg-Kirkpatrick Doctor of Philosophy in Computer Science University of California, Berkeley Professor Dan Klein, Chair The presence of hidden structure in human data—including natural language but also sources like music, historical documents, and other complex artifacts—makes this data extremely difficult to analyze. In this thesis...
متن کاملImproved Typesetting Models for Historical OCR
We present richer typesetting models that extend the unsupervised historical document recognition system of BergKirkpatrick et al. (2013). The first model breaks the independence assumption between vertical offsets of neighboring glyphs and, in experiments, substantially decreases transcription error rates. The second model simultaneously learns multiple font styles and, as a result, is able to...
متن کامل